{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 指数族分布"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "之前我们看到的很多分布函数，除了混合高斯分布之外，都可以归为一类，即指数族分布（`exponential family`）。\n",
    "\n",
    "一般来说，对于随机变量 $\\mathbf x$，参数 $\\mathbf \\eta$，指数族分布具有如下的形式：\n",
    "\n",
    "$$\n",
    "\\mathbf p(\\mathbf x|\\mathbf \\eta)=h(\\mathbf x)g(\\mathbf \\eta)\\exp\\left\\{\\mathbf{\\eta^\\top u(x)}\\right\\}\n",
    "$$\n",
    "\n",
    "随机变量 $\\mathbf x$ 可以是向量或者标量，可以是离散的也可以是连续的。$\\mathbf \\eta$ 叫做分布的自然（特性）参数（`natural parameter`），$\\bf u(x)$ 是 $\\bf x$ 的一个函数。\n",
    "\n",
    "$g(\\mathbf \\eta)$ 可以看出是一个归一化参数，保证概率分布是归一化的，连续情况下有：\n",
    "\n",
    "$$\n",
    "g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} d\\mathbf x=1\n",
    "$$\n",
    "\n",
    "离散情况将积分换成求和即可。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 伯努利分布"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "伯努利分布为：\n",
    "\n",
    "$$\n",
    "p(x|\\mu) = {\\rm Bern} = \\mu^x(1-\\mu)^{1-x}\n",
    "$$\n",
    "\n",
    "我们有\n",
    "\n",
    "$$\n",
    "p(x|\\mu) = \\exp\\{x\\ln\\mu+(1-x)\\ln(1-\\mu)\\} \n",
    "= (1-\\mu)\\exp\\left\\{\\ln\\left(\\frac{\\mu}{1-\\mu}\\right)\\cdot x\\right\\}\n",
    "$$\n",
    "\n",
    "与指数族分布的形式比较，我们有：\n",
    "\n",
    "$$\n",
    "\\eta=\\ln\\left(\\frac{\\mu}{1-\\mu}\\right)\n",
    "$$\n",
    "\n",
    "从而\n",
    "\n",
    "$$\n",
    "\\mu = \\sigma(\\eta) = \\frac{1}{1+\\exp(-\\eta)}\n",
    "$$\n",
    "\n",
    "即大家所熟悉的逻辑斯特 `sigmoid` 函数。从而我们可以将伯努利分布写成标准的指数族分布形式：\n",
    "\n",
    "$$\n",
    "p(x|\\eta) = (1-\\sigma(\\eta))\\exp(\\eta x) = \\sigma(-\\eta)\\exp(\\eta x)\n",
    "$$\n",
    "\n",
    "对应的参数分别为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "u(x) &= x\\\\\n",
    "h(x) &= 1\\\\\n",
    "g(\\eta) &= \\sigma(-\\eta)\n",
    "\\end{align}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 多项分布"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "考虑多项分布在一次观测下的情况：\n",
    "\n",
    "$$\n",
    "p(\\mathbf x|\\mathbf \\mu) = \\sum_{k=1}^M \\mu_k^{x_k} = \\exp\\left\\{\\sum_{k=1}^M x_k\\ln\\mu_k\\right\\}\n",
    "$$\n",
    "\n",
    "其中 $\\mathbf x = (x_1,\\dots,x_M)^\\top$。\n",
    "\n",
    "定义 $\\eta_k = \\ln\\mu_k, \\mathbf\\eta=(\\eta_1,\\dots,\\eta_M)$，我们有：\n",
    "\n",
    "$$\n",
    "p(\\mathbf x|\\mathbf \\eta) = \\exp(\\mathbf\\eta^\\top x) \n",
    "$$\n",
    "\n",
    "对应的参数分别为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\mathbf{u(x)} &= \\mathbf x\\\\\n",
    "h(\\mathbf x) &= 1\\\\\n",
    "g(\\mathbf \\eta) &= 1\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "但是由于有 $\\sum_{k=1}^M \\mu_k= 1$ 的限制，所以这些参数只有 $M-1$ 个是独立的。\n",
    "\n",
    "我们用 $\\mu_M= 1-\\sum_{k-1}^{M-1}\\mu_k$ 进行替换，注意有约束条件：\n",
    "\n",
    "$$\n",
    "0\\leq\\mu_k\\leq 1, \\sum_{k-1}^{M-1}\\mu_k \\leq 1\n",
    "$$\n",
    "\n",
    "我们有\n",
    "\n",
    "$$\n",
    "\\exp\\left\\{\\sum_{k=1}^M x_k\\ln\\mu_k\\right\\} \n",
    "= \\exp\\left\\{\\sum_{k=1}^{M-1} x_k\\ln\\mu_k + \\left(1-\\sum_{k-1}^{M-1}\\mu_k\\right) \\ln\\left(1-\\sum_{k-1}^{M-1}\\mu_k\\right)\\right\\}\n",
    "= \\exp\\left\\{\\sum_{k=1}^{M-1} x_k\\ln\\left(\\frac{\\mu_k}{1-\\sum_{j-1}^{M-1}\\mu_j}\\right) + \\ln\\left(1-\\sum_{k-1}^{M-1}\\mu_k\\right)\\right\\}\n",
    "$$\n",
    "\n",
    "此时我们定义：\n",
    "\n",
    "$$\n",
    "\\eta_k = \\ln\\left(\\frac{\\mu_k}{1-\\sum_{j-1}^{M-1}\\mu_j}\\right)\n",
    "$$\n",
    "\n",
    "则\n",
    "\n",
    "$$\n",
    "\\mu_k = \\frac{\\exp(\\eta_k)}{1+\\sum_{j=1}^{M-1}\\exp(\\eta_j)}\n",
    "$$\n",
    "\n",
    "即我们所熟知的 `softmax` 函数形式。\n",
    "\n",
    "从而\n",
    "\n",
    "$$\n",
    "p(\\mathbf x|\\mathbf\\eta) = \\left(1+\\sum_{k=1}^{M-1} \\exp(\\eta_k)\\right)^{-1} \\exp(\\eta^T x)\n",
    "$$\n",
    "\n",
    "对应的参数分别为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\mathbf{u(x)} &= \\mathbf x\\\\\n",
    "h(\\mathbf x) &= 1\\\\\n",
    "g(\\mathbf \\eta) &=  \\left(1+\\sum_{k=1}^{M-1} \\exp(\\eta_k)\\right)^{-1}\n",
    "\\end{align}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 高斯分布"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一维高斯分布为\n",
    "\n",
    "$$\n",
    "p(x|\\mu,\\sigma^2)\n",
    "= \\frac{1}{(2\\pi\\sigma^2)^{1/2}} \\exp\\left\\{-\\frac{1}{2\\sigma^2}(x-\\mu)^2\\right\\}\n",
    "= \\frac{1}{(2\\pi\\sigma^2)^{1/2}} \\exp\\left\\{-\\frac{1}{2\\sigma^2}x^2+\\frac{\\mu}{2\\sigma^2}x-\\frac{1}{2\\sigma^2}\\mu^2\\right\\}\n",
    "$$\n",
    "\n",
    "对应的参数分别为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\mathbf \\eta & = \\begin{pmatrix}\\mu/\\sigma^2 \\\\ -1/2\\sigma^2\\end{pmatrix} \\\\\n",
    "\\mathbf u(x) &= \\begin{pmatrix}x \\\\ x^2\\end{pmatrix}\\\\\n",
    "h(\\mathbf x) &= (2\\pi)^{-1/2}\\\\\n",
    "g(\\mathbf \\eta) &= (-2\\eta_2)^{1/2} \\exp\\left(\\frac{\\eta_1^2}{4\\eta_2}\\right)\n",
    "\\end{align}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4.1 最大似然和充分统计量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下式两边对 $\\bf \\eta$ 求梯度：\n",
    "\n",
    "$$\n",
    "g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} d\\mathbf x=1\n",
    "$$\n",
    "\n",
    "有\n",
    "\n",
    "$$\n",
    "\\triangledown g(\\mathbf\\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} d\\mathbf x + g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} \\mathbf{u(x)}d\\mathbf x= 0\n",
    "$$\n",
    "\n",
    "结合原来的等式，我们有：\n",
    "\n",
    "$$\n",
    "-\\frac{1}{g(\\mathbf\\eta)}\\triangledown g(\\mathbf\\eta) = g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} u(\\mathbf x)d\\mathbf x = \\mathbb E[\\mathbf{u(x)}]\n",
    "$$\n",
    "\n",
    "从而\n",
    "\n",
    "$$\n",
    "- \\triangledown \\ln g(\\mathbf\\eta) = \\mathbb E[\\mathbf{u(x)}]\n",
    "$$\n",
    "\n",
    "再对 $\\bf \\eta$ 求一次梯度有：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "- \\triangledown \\triangledown \\ln g(\\mathbf\\eta) = & \n",
    "g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)} \\right\\} u(\\mathbf x)u(\\mathbf x)^\\top d\\mathbf x \\\\ & + \\triangledown g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} \\mathbf{u(x)} d\\mathbf x\\\\\n",
    "= & \\mathbb E[\\mathbf{u(x)u(x)^\\top}] - \\mathbb E[\\mathbf{u(x)}]\\mathbb E[\\mathbf{u(x)}^\\top] \\\\\n",
    "= & \\mathrm{cov}[\\mathbf{u(x)}]\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "这样我们就得到了它的协方差矩阵。\n",
    "\n",
    "有了这个结论，我们考虑它的最大似然估计，设数据点为 $\\mathbf X=\\{\\mathbf x_1, \\dots ,\\mathbf x_n\\}$，\n",
    "\n",
    "\n",
    "似然函数为：\n",
    "\n",
    "$$\n",
    "p({\\bf X|\\eta}) = \\left(\\prod_{n=1}^N h(\\mathbf x_n)\\right) g(\\mathbf \\eta)^N \\exp\\left\\{\\mathbf{\\eta}^\\top\\sum_{n=1}^N \\mathbf{u}(\\mathbf{x}_n)\\right\\}\n",
    "$$\n",
    "\n",
    "对数似然函数为：\n",
    "\n",
    "$$\n",
    "\\ln p({\\bf X|\\eta}) = \\sum_{n=1}^N \\ln h(\\mathbf x_n) + N \\ln g(\\mathbf \\eta) + \\mathbf{\\eta}^\\top\\sum_{n=1}^N \\mathbf{u}(\\mathbf{x}_n)\n",
    "$$\n",
    "\n",
    "考虑对 $\\bf \\eta$ 的梯度，并将其设为 0，有\n",
    "\n",
    "$$\n",
    "- \\triangledown \\ln g(\\mathbf\\eta_{ML}) = \\frac{1}{N} \\sum_{n=1}^N \\mathbf{u}(\\mathbf{x}_n)\n",
    "$$\n",
    "\n",
    "当 $N\\to\\infty$，它就是均值 $ \\mathbb E[\\mathbf{u(x)}]$。\n",
    "\n",
    "我们看到，对于参数 $\\mathbf \\eta$ 的估计只依赖于 $\\sum_{n} \\mathbf{u}(\\mathbf{x}_n)$，从而 $\\sum_{n} \\mathbf{u}(\\mathbf{x}_n)$ 是它的一个充分统计量（`sufficient statistic`）。这意味着我们只需要存储这个充分统计量即可。\n",
    "\n",
    "例如伯努利分布（$u(x)=x$）的充分统计量是 $\\sum_{n} x_n$，高斯分布（$\\mathbf u(x)=\\begin{bmatrix} x \\\\ x^2\\end{bmatrix}$）的充分统计量为 $\\begin{bmatrix} \\sum_{n} x_n \\\\ \\sum_{n} x_n^2\\end{bmatrix}$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 2.4.2 共轭先验"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "对于指数族分布，考虑似然函数的形式，我们可以使用如下的共轭先验分布：\n",
    "\n",
    "$$\n",
    "p(\\mathbf\\eta|\\mathbf\\chi,\\nu)=f(\\mathbf\\chi,\\nu)g(\\mathbf\\eta)^{\\nu} \\exp\\left\\{\\nu\\mathbf{\\eta^\\top \\chi}\\right\\}\n",
    "$$\n",
    "\n",
    "其中 $\\mathbf{\\chi}$ 是一个向量，$\\nu$ 是一个标量。\n",
    "\n",
    "这样后验分布就是：\n",
    "\n",
    "$$\n",
    "p(\\mathbf\\eta|\\mathbf{X, \\chi},\\nu) \\propto g(\\mathbf\\eta)^{\\nu + N} \\exp\\left\\{\\nu\\mathbf{\\eta^\\top \\left(\\chi+\\sum_{n=1}^N \\mathbf u(\\mathbf x_n)\\right)}\\right\\}\n",
    "$$\n",
    "\n",
    "参数 $\\nu$ 可以认为是先验的观测样本数，$\\bf \\chi$ 可以认为是先验的充分统计量。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 2.4.3 无信息先验"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果我们不知道先验分布的信息，那么我们可以使用无信息先验分布（`noninformative prior`）。\n",
    "\n",
    "对于这个分布 $p(x|\\lambda)$，最简单的情况为，我们认为参数 $\\lambda$ 的先验分布是一个等概率的分布。如果 $\\lambda$ 是一个离散的参数，那么这相当于是一个均匀分布；对于连续分布来说，这存在着两个问题：\n",
    "\n",
    "第一个问题在于，如果 $\\lambda$ 是无界的，这个分布不收敛，因为它的积分发散；这种先验叫做非正常先验（`improper prior`）。在实际应用中，只要后验分布是正常的，我们可以使用非正常先验。例如，如果我们使用一个均匀分布作为高斯分布均值的先验，那么只要我们观测到了数据点，这个后验分布就是正常的。\n",
    "\n",
    "第二个问题是对于概率密度函数在非线性转换下的问题，假设函数 $h(\\lambda)=\\mathrm{constant}$，那么在转换 $\\lambda = \\eta^2$ 下，我们得到 $\\hat h(\\eta) = h(\\eta^2)=\\mathrm{constant}$，但是对于概率密度函数来说，如果 $p_\\lambda(\\lambda)$ 是常数，则：\n",
    "\n",
    "$$\n",
    "p_\\eta(\\eta)=p_\\lambda(\\lambda)\\left|\\frac{d\\lambda}{d\\eta}\\right| = p_\\lambda(\\eta^2) 2\\eta \\propto \\eta\n",
    "$$\n",
    "\n",
    "不再是一个常数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 位置参数和尺度参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们考虑两种类型的无信息先验：位置参数（`location parameter`）和尺度参数（`scale parameter`）。\n",
    "\n",
    "对于位置参数 $\\mu$，其形式满足：\n",
    "\n",
    "$$\n",
    "p(x|\\mu) = f(x-\\mu)\n",
    "$$\n",
    "\n",
    "做变换 $\\bar x = x+c$，密度分布是不变的。\n",
    "\n",
    "我们考虑这个参数的先验是均匀的情况，即对于 $\\mu$ 落在区间\n",
    "$A \\leq \\mu \\leq B$ 和 $A-c \\leq \\mu \\leq B-c$ 的概率是相同的：\n",
    "\n",
    "$$\n",
    "\\int_{A}^B p(\\mu) d\\mu = \\int_{A-c}^{B-c} p(\\mu) d\\mu = \\int_{A}^B p(\\mu-c) d\\mu\n",
    "$$\n",
    "\n",
    "而它对任意 $A, B$ 成立，因此必有：\n",
    "\n",
    "$$\n",
    "p(\\mu) = p(\\mu-c)\n",
    "$$ \n",
    "\n",
    "从而 $p(\\mu)$ 是一个常数。\n",
    "\n",
    "对于高斯分布来说，$\\mu$ 是一个位置参数，共轭先验是 $\\mathcal N(\\mu_0,\\sigma_0^2)$，当 $\\sigma_0^2$ 趋于无穷时，它就变成了我们需要的无信息先验。\n",
    "\n",
    "对于尺度参数 $\\sigma$，其形式满足\n",
    "\n",
    "$$\n",
    "p(x|\\sigma) = \\frac{1}{\\sigma} f(\\frac{x}{\\sigma})\n",
    "$$\n",
    "\n",
    "做变换 $\\bar x = cx$，密度分布是不变的。\n",
    "\n",
    "为了反映这个尺度的不变性，我们的先验分布应该满足，$\\sigma$ 落在区间\n",
    "$A \\leq \\sigma \\leq B$ 和 $A/c \\leq \\sigma \\leq B/c$ 的概率是相同的：\n",
    "\n",
    "$$\n",
    "\\int_{A}^{B} p(\\sigma) d\\sigma = \\int_{A/c}^{B/c} p(\\sigma) d\\sigma =\\int_{A}^{B} p\\left(\\frac{1}{\\sigma}\\sigma\\right) \\frac{1}{c} d\\sigma \n",
    "$$\n",
    "\n",
    "从而\n",
    "\n",
    "$$\n",
    " p(\\sigma) = p\\left(\\frac{1}{\\sigma}\\sigma\\right) \\frac{1}{c}\n",
    "$$\n",
    "\n",
    "从而 $ p(\\sigma) \\propto 1/\\sigma$，这个在 $0\\leq\\sigma\\leq\\infty$ 区间的积分是无界的。不过在有界区间是积分是有界的。\n",
    "\n",
    "对于高斯分布来说，$\\sigma$ 是一个尺度参数，如果考虑精确度 $\\lambda=\\frac{1}{\\sigma^2}, \\sigma = \\lambda^{-1/2}$，我们有 $p(\\lambda) \\propto \\lambda^{1/2} \\lambda^{-3/2} = \\frac{1}{\\lambda}$，这对应于共轭先验伽马分布的参数 $a_0=b_0=0$ 的情况。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
